Linear Regression: Part 1

| include: false

What does the line above do?

The line above will allow the code to run, it will not appear. The practical reasoning for this is to make the document more clean. (Suppresses the output from the chunk but will run it).

We are working with a dataset from a fictional coffee shop. The data is contained in a CSV file (already uploaded in the Files pane at right). Let’s load the data and take a look at what we have.

Linear Regression

Linear regression is a data analysis technique that predicts the value of unknown data by using another related and known data value

shop = read_csv("coffee_shop.csv")
Rows: 50 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): Week, Sales, Rating

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Examine the dataset

str(shop)
spc_tbl_ [50 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Week  : num [1:50] 1 2 3 4 5 6 7 8 9 10 ...
 $ Sales : num [1:50] 275 293 382 476 438 ...
 $ Rating: num [1:50] 9.2 8.8 8.7 9.3 9.5 9.5 8.6 8.8 9.2 9.5 ...
 - attr(*, "spec")=
  .. cols(
  ..   Week = col_double(),
  ..   Sales = col_double(),
  ..   Rating = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

This a straightforward dataset with three numeric variables: Week, Sales, and Rating.

summary(shop)
      Week           Sales            Rating     
 Min.   : 1.00   Min.   : 274.8   Min.   :7.700  
 1st Qu.:13.25   1st Qu.: 818.8   1st Qu.:8.725  
 Median :25.50   Median :1457.6   Median :9.000  
 Mean   :25.50   Mean   :1463.7   Mean   :9.010  
 3rd Qu.:37.75   3rd Qu.:2045.8   3rd Qu.:9.300  
 Max.   :50.00   Max.   :2667.2   Max.   :9.800  

Our typical data cleaning tasks of type conversion and dealing with missing data are not necessary as the variables are already in the correct type and there is no missing data (how do we know this?).


Let’s take a look at linear regression. The basic idea is that we want to model the linear relationship between two variables. Later, we’ll look at situations where there are more than two variables. We’ll define one of our variables as the response variable. This is the variable that we are trying to predict. In reality, this is typically defined for us before we start to attack a problem. Some or all of the other variables in the dataset may then by predictor variables. We assume that there is a linear relationship between predictor variables and the response variable.

Before we start, the following must be true (for linear regression):

  • The response variable MUST be numeric (Y: always numeric)

  • The predictor variables can be numeric or categorical

We assume that the relationship between our predictor variable and the response variable can be modeled with the following relationship (for a single predictor variable model):

\(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1\)

In this model, the \(y\) variable is the response, the \(x\) variable is the predictor, the \(\beta_0\) is the y-intercept (the point where the regression line crosses the y axis), and the \(\beta_1\) is the slope of the regression. Recall, from algebra, that if we know the slope and y-intercept that that is sufficient to be able to draw a line.

Best line = Best fit

x - ind. var = predictor

y - dep. var = response

Linear Regression Line (Source: AnalystPrep)

There are several assumptions that must be true in order for us to successfully build a linear regression model:

  • Linear relationship between \(x\) and \(y\)

    • LOOKS LIKE A LINE :))
  • Regression model errors are independent. Error is the difference between a predicted value and the actual value. The direction and size of the “next” error in a model should not be predictable.

    • ERRORS NEED TO BE INDEPENDENT
  • The variance of errors is constant. We’ll explore this further later.

    • DOES IT LOOK LIKE A FUNNEL? SPREAD OF THE ERRORS = THIS MEANS THAT THE GOOD PREDICTIONS ARE CLOSER TO ZER0
  • Errors should be Normally distributed. We’ll explore this further later.

    • “NORMALLY” DISTRIBUTED … BELL CURVE?

Our workflow for linear regression looks like this:

  • Import the data and familiarize ourselves with the dataset (be sure that we can identify which variable in the dataset is the response variable and which variable(s) will serve as predictors).

  • Clean and prepare the data (e.g., type conversions, missing data, etc.).

  • Visualize the relationship between each potential predictor variable and the response variable.

  • Calculate correlation between each NUMERIC predictor and the response variable.

  • Build and evaluate linear regression models.

  • Select a “champion” model or set of models.


In our coffee shop dataset, which variable is the response?

Sales is the variable; strong relationship & positively correlated and LINEAR

Let’s start our workflow with visualization.

ggplot(shop,aes(x=Week,y=Sales)) + geom_point() + theme_bw()

How would we describe the relationship between these variables?

No real relationship; weak relationship, not LINEAR

ggplot(shop,aes(x=Rating,y=Sales)) + geom_point() + theme_bw()

How would we describe the relationship between these two variables?

Week and sales is strong (0.998) and Rating and sales is weak (-0.115)

-1 and 1 = perfect line [in either direction] {0 = not correlated}

Next in our workflow, we look at correlation of the numeric predictors (or potential predictors) with the response. What does this plot tell us?

It tells us the relationship between variables and if they’re linear.

corr = cor(shop) #develop basic correlation matrix
ggcorrplot(corr, lab = TRUE, digits = 3)

It’s pretty clear that Week has a STRONG, positive, linear relationship with Sales. Rating has a much weaker relationship (slightly negative).

BIG NOTE: Correlation measures the strength of LINEAR relationship between two variables. It DOES NOT measure that strength well if the relationship is NON-LINEAR!

Correlation Examples

We can now build our linear regression models. Let’s build a model with Week to predict Sales.

model1 = lm(Sales ~ Week, shop)

#lm = linear model
#sales is our response model
#week is our predictor model
#shop is the name of our dataset
summary(model1)

Call:
lm(formula = Sales ~ Week, data = shop)

Residuals:
    Min      1Q  Median      3Q     Max 
-91.190 -33.055  -1.365  29.513 107.730 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 203.8136    13.3103   15.31   <2e-16 ***
Week         49.4083     0.4543  108.76   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 46.35 on 48 degrees of freedom
Multiple R-squared:  0.996, Adjusted R-squared:  0.9959 
F-statistic: 1.183e+04 on 1 and 48 DF,  p-value: < 2.2e-16

Regression Line

Sales = 203.8 + 49.4*Week

Pr(>|t|)

This means the week variable is a significant predictor of sales😊

<2e-16

Positive means increase, as the week increased - so did sales.

Multiple R-squared: 0.996 - displays model quality (closer to 1 is optimal)

Let’s talk our way through the output. It looks complicated, but there are only a few items that we really need to pay attention to.

Next, we look at our regression diagnostics. The first plot puts the predicted values on the x axis and the residuals on the y. This plot should look like 1) a random scattering of points with no discernible patterns and 2) the spread (variance) of residuals should not change as we work our way from left to right on the plot. Does this happen in this plot?

No pattern ✔️ & no spread ✔️

ggplot(model1, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0) + theme_bw()

Here’s an example of patterned residuals (Source: ISLR textbook):

Patterned Residuals

Here’s an example of non-constant variance of residuals (Source: ISLR textbook):

Non-Constant Variance of Residuals

Next is a plot of the distribution of residuals. We can use a histogram for this. Ideally, this histogram should look like a Normal distribution.

  • week is a significant predictor of sales

  • positive relationship

  • the R-squared value is really close to 1

ggplot(model1, aes(x = .resid)) +
  geom_histogram(binwidth = 20) + #experiment with binwidth to get "good" histogram
  theme_bw()


Let’s repeat this analysis, but for Rating.

model2 = lm(Sales ~ Rating, shop)
summary(model2)

Call:
lm(formula = Sales ~ Rating, data = shop)

Residuals:
     Min       1Q   Median       3Q      Max 
-1209.94  -691.92    19.87   570.85  1205.98 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   3150.1     2110.6   1.493    0.142
Rating        -187.2      234.0  -0.800    0.428

Residual standard error: 724.4 on 48 degrees of freedom
Multiple R-squared:  0.01316,   Adjusted R-squared:  -0.007403 
F-statistic: 0.6399 on 1 and 48 DF,  p-value: 0.4277

How do we feel about this model? How does it differ from the previous model?

* R- squared is awful - near 0

* Rating is not a significant predictor of sales (-187.2)


We can extend our regression model to have multiple predictors. We’ll go into this in more detail in the future, but here’s the basic idea. The regression equation expands to incorporate more than one predictor variable:

\(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + ... + \hat{\beta}_n x_n\)

The lm model is easy to change:

model3 = lm(Sales ~ Week + Rating, shop)
summary(model3)

Call:
lm(formula = Sales ~ Week + Rating, data = shop)

Residuals:
    Min      1Q  Median      3Q     Max 
-84.911 -32.618  -2.136  24.873 105.671 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 116.1588   138.8182   0.837    0.407    
Week         49.4436     0.4605 107.370   <2e-16 ***
Rating        9.6288    15.1779   0.634    0.529    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 46.65 on 47 degrees of freedom
Multiple R-squared:  0.996, Adjusted R-squared:  0.9958 
F-statistic:  5841 on 2 and 47 DF,  p-value: < 2.2e-16

What’s the interpretation of this model?

NOTE: For a “fair” comparison of models with differing numbers of variables, use the adjusted R-squared value. This value adjusts for the number of variables in the model.